library(readr)
library(dplyr)
library(ggplot2)
library(RColorBrewer)
library(leaflet)
library(geojsonio)
library(htmltools)
library(htmlwidgets)

Colombia COVID-19

LINK: https://www.kaggle.com/camesruiz/colombia-covid19-complete-dataset

DESCRIPTION: Coronavirus (COVID-19) made its outbreak in Colombia with the first confirmed in the country on march 6th, since then, number of confirmed cases has been increasing and deaths related to the virus are starting to have the first confirmed cases. This data set contains complete information about confirmed cases, deaths and number of recovered patients according to the daily reports by the colombian health department (Ministerio de Salud)

GOAL: Build a model for the number of confirmed cases in the different Colombia regions. You have the access to some covariates, such as: Edad (age), Sexo (Sex), Pais de procedencia (origin country) of the individual cases. Try to test the predictive accuracy of your selected model.

ATTENTION: Three countries are here considered: Colombia, Mexico and India. Each different group of students should focus on a geographical sub-area of one of the three countries, say the northern, the central or the southern part of the countries, by pooling all the regions/states/departments belonging to the considered area. Say: group A focuses on Northern Mexico, group B on Central Mexico, and so on. The distinction in northern, central and southern is not strict, you have some flexibility.


Our Project

We decided to do central Colombia because it is where the capital is.

The largest cities in the country are Bogotá (in the center), Medellín (in the north, close to central), Cali (in the center) and Barranquilla (extreme north).

Dataset - First Exploration

colombia_covid <- read_csv("data/colombia_covid.csv")
unique(colombia_covid$`Departamento o Distrito`)
##  [1] "Bogotá D.C."           "Valle del Cauca"       "Antioquia"            
##  [4] "Cartagena D.T. y C"    "Huila"                 "Meta"                 
##  [7] "Risaralda"             "Norte de Santander"    "Caldas"               
## [10] "Cundinamarca"          "Barranquilla D.E."     "Santander"            
## [13] "Quindío"               "Tolima"                "Cauca"                
## [16] "Santa Marta D.T. y C." "Cesar"                 "San Andrés"           
## [19] "Casanare"              "Nariño"                "Atlántico"            
## [22] "Boyacá"                "Córdoba"               "Bolívar"              
## [25] "Sucre"                 "La Guajira"
#options(tibble.print_max = Inf) # to show all the elements of the list, but set it also for other chunks
#options(tibble.width = Inf)
colombia_covid %>%
  group_by(`Departamento o Distrito`) %>%
  count()
## # A tibble: 26 x 2
## # Groups:   Departamento o Distrito [26]
##    `Departamento o Distrito`     n
##    <chr>                     <int>
##  1 Antioquia                   127
##  2 Atlántico                     4
##  3 Barranquilla D.E.            31
##  4 Bogotá D.C.                 542
##  5 Bolívar                       3
##  6 Boyacá                        6
##  7 Caldas                       16
##  8 Cartagena D.T. y C           39
##  9 Casanare                      2
## 10 Cauca                        12
## # … with 16 more rows

Colombia is divided into 32 departments. According to Wikipedia we miss the Departments of Amazonas, Arauca, Caquetá, Chocó, Guainía, Guaviare, Magdalena, Putumayo, Vaupés, Vichada.

Bogotá, Distrito Capital in in the Cundinamarca Department. Barranquilla D.E. is a “Distrito Especial” but should be in the Atlántico Department.

The Districts (Spanish: Distrito) in Colombia are cities that have a feature that highlights them, such as its location and trade, history or tourism. Arguably, the districts are special municipalities. The districts are Bogotá, Barranquilla, Cartagena, Santa Marta, Cúcuta, Popayán, Tunja, Buenaventura, Turbo and Tumaco.

We miss Cúcuta, Popayán, Tunjaa, Buenaventura, Turbo and Tumaco.

#lat-long
#bogota<c(4.592164298, -74.072166378, 542)
valle_de_cauca<-c("Valle del Cauca", 3.4372200, -76.5225000, 150)
cauca<-c("Cauca", 2.43823, -76.6131592, 12) 
antioquia<-c("Antioquia",6.2518400, -75.5635900, 127)
cartagena<-c("Cartagena D.T. y C", 10.39972, -75.51444, 39)
huila<-c("Huila", 2.9273, -75.2818909, 30)
meta<-c("Meta", 4.1420000, -73.6266400, 12)
risaralda<-c("Risaralda", 4.8133302, -75.6961136, 35)
norte_santander<-c("Norte de Santander", 7.8939100, -72.5078200, 21)
caldas<-c("Caldas", 5.0688900, -75.5173800, 16)
cudinamarca<-c("Cundinamarca", 4.862437, -74.058655, 42)
barraquilla<-c("Barranquilla D.E.", 10.9685400, -74.7813200, 35) #atlantico
santader<-c("Santander", 7.1253900, -73.1198000, 12)
quindio<-c("Quindío", 4.535000, -75.675690, 23)
tolima<-c("Tolima", 4.43889, -75.2322235, 14)
santa_marta<-c("Santa Marta D.T. y C.", 11.24079, -74.19904, 12)
cesar<-c("Cesar", 10.4631400, -73.2532200, 16)
san_andres<-c("San Andrés", 12.5847197, -81.7005615, 2)
casanare<-c("Casanare", 5.3377500, -72.3958600, 2)
narino<-c("Nariño", 1.2136100, -77.2811100, 6)
boyaca<-c("Boyacá", 5.767222, -72.940651, 6)
cordoba<-c("Córdoba", 8.7479800, -75.8814300, 2)
bolivar<-c("Bolívar", 10.3997200, -75.5144400, 3)
sucre<-c("Sucre", 9.3047199, -75.3977814, 1)
guajira<-c("La Guajira", 11.5444400, -72.9072200, 1)

gis_data<-data.frame(name="Bogotá D.C.", latitude=4.624335, longitude=-74.063644, cases=542) #bogotà
gis_data$name<-as.character(gis_data$name)
gis_data<-rbind(gis_data, cauca)
gis_data<-rbind(gis_data, valle_de_cauca)
gis_data<-rbind(gis_data, antioquia)
gis_data<-rbind(gis_data, cartagena)
gis_data<-rbind(gis_data, huila)
gis_data<-rbind(gis_data, meta)
gis_data<-rbind(gis_data, risaralda)
gis_data<-rbind(gis_data, norte_santander)
gis_data<-rbind(gis_data, caldas)
gis_data<-rbind(gis_data, cudinamarca)
gis_data<-rbind(gis_data, barraquilla)
gis_data<-rbind(gis_data, santader)
gis_data<-rbind(gis_data, quindio)
gis_data<-rbind(gis_data, tolima)
gis_data<-rbind(gis_data, santa_marta)
gis_data<-rbind(gis_data, cesar)
gis_data<-rbind(gis_data, san_andres)
gis_data<-rbind(gis_data, casanare)
gis_data<-rbind(gis_data, narino)
gis_data<-rbind(gis_data, boyaca)
gis_data<-rbind(gis_data, cordoba)
gis_data<-rbind(gis_data, bolivar)
gis_data<-rbind(gis_data, sucre)
gis_data<-rbind(gis_data, guajira)

gis_data$latitude<-as.numeric(gis_data$latitude)
gis_data$longitude<-as.numeric(gis_data$longitude)
gis_data$cases<-as.numeric(gis_data$cases)
getColor <- function(gis_data) {
  sapply(gis_data$cases, function(cases) {
  if(cases <= 10) {
    "green"
  } else if(cases <= 100) {
    "orange"
  } else {
    "red"
  } })
}

icons <- awesomeIcons(
  icon = 'ios-close',
  iconColor = 'black',
  library = 'ion',
  markerColor = getColor(gis_data)
)

dept<-geojsonio::geojson_read("data/Colombia.geo.json", what="sp")

labels <- sprintf(
  "<strong>%s</strong><br/>",
  dept$NOMBRE_DPT
) %>% lapply(htmltools::HTML)

m <- leaflet(data=gis_data) %>% addTiles() %>%
  addAwesomeMarkers(~longitude, ~latitude, icon=icons, label = ~as.character(cases)) %>%
  addProviderTiles(providers$CartoDB.Positron) %>%
  addPolygons(data=dept,     
              opacity = 0.2,
              color = "yellow",
              dashArray = "3",
              fillOpacity = 0.1,
              highlight = highlightOptions(
                weight = 5,
                color = "#666",
                dashArray = "",
                fillOpacity = 0.7,
                bringToFront = TRUE),
                label = labels)
m
#saveWidget(m, file="figs/colombia.html")

The color of the pins is related with the number of cases: if they are less than 10 the color is “green”, if they are less than 100 the color is “orange”, otherwise it is “red”.
On the map there are all the cities/departments for which we have data. We can notice that we don’t have any data in the south of the country.

Reading here and there I found that Colombia in divided in 5 regions, the central one comprises: Boyacá, Tolima, Cundinamarca, Meta, Bogotà DC.

ANGELA: Seeing Wikipedia I think that the Orinoquía Region (Meta, Arauca, Casanare and Vichada Departments) is in the center, so I would add also Arauca, Casanare and Vichada. I noticed that we only have Casanare, the other two doesn’t have data.

GAIA: added Casanare, for the others we have no data!

However, since in our assignment Colombia is divided in 3 parts, I think that we should add some more regions (e.g. Quindío, Valle del Cauca, Risaralda, Celdas, Boyacá and possibly Antioquia and Santander)

#slice the main dataset
central.colombia.dep<-c("Bogotá D.C.", "Tolima", "Cundinamarca", "Meta", "Boyacá", "Quindío", "Cauca", "Valle del Cauca", "Risaralda", "Caldas", "Boyacá", "Antioquia", "Santander", "Casanare")
central.colombia.rows<-which(colombia_covid$`Departamento o Distrito` %in% central.colombia.dep)
colombia_covid<-colombia_covid[central.colombia.rows,]

#slice the gis_data dataset
central_gis_data<-gis_data[which(gis_data$name %in% central.colombia.dep),]
#slice the geojson dataset 
central_dept<-c("SANTAFE DE BOGOTA D.C", "TOLIMA", "CUNDINAMARCA", "META", "BOYACA", "QUINDIO", "CAUCA", "VALLE DEL CAUCA", "RISARALDA", "CALDAS", "BOYACA", "ANTIOQUIA", "SANTANDER", "CASANARE")
central.dept<-dept[which(dept@data$NOMBRE_DPT %in% central_dept),]
getColor <- function(central_gis_data) {
  sapply(central_gis_data$cases, function(cases) {
  if(cases <= 10) {
    "green"
  } else if(cases <= 100) {
    "orange"
  } else {
    "red"
  } })
}

icons <- awesomeIcons(
  icon = 'ios-close',
  iconColor = 'black',
  library = 'ion',
  markerColor = getColor(central_gis_data)
)
#now colombia is yellow and our departments are red
central_map <- leaflet(data=central_gis_data) %>% addTiles() %>%
  addAwesomeMarkers(~longitude, ~latitude, icon=icons, label = ~htmlEscape(paste(name,cases,sep=" : "))) %>%
  addProviderTiles(providers$CartoDB.Positron) %>%
  addPolygons(data=dept,
              opacity=0.1,
              color="yellow",
              fillOpacity = 0.1) %>%
  addPolygons(data=central.dept,     
              opacity = 0.2,
              color = "red",
              dashArray = "3",
              fillOpacity = 0.1)
central_map
#saveWidget(central_map, file="figs/central_colombia.html")

Some very basics plots

Let’s check the situation (and also the power of ggplot)!

Scattered infos about pandemic in Colombia (https://en.wikipedia.org/wiki/COVID-19_pandemic_in_Colombia):

  • the quarantine started on the 20th of March, since our data are from 6th of March to 2nd of April, it is very likeliy that quarantine effects are not witnessed in our data.

  • on March the 26th there was a damage in the machine that prepared the samples for processing and subsequent diagnosis of COVID-19, which affected the speed at which results were being produced. This could explain the very low number of confirmed cases.

theme_set(theme_classic())
region<-colombia_covid$`Departamento o Distrito`
nrows<-10
df <- expand.grid(y = 1:nrows, x = 1:nrows)
categ_table <- round(table(region) * ((nrows*nrows+1)/(length(region))))
df$category<-factor(rep(names(categ_table), categ_table))
waffle_chart <- ggplot(df, aes(x = x, y = y, fill = df$category)) + 
        geom_tile(color = "black", size = 0.5) +
        scale_x_continuous(expand = c(0, 0)) +
        scale_y_continuous(expand = c(0, 0), trans = 'reverse') +
        scale_fill_brewer(palette = "Set3") +
        labs(title="Frequency of cases across Departments", subtitle="Waffle Chart") + 
        theme(#panel.border = element_rect(size = 2),
        plot.title = element_text(size = rel(1.2)),
              axis.text = element_blank(),
              axis.title = element_blank(),
              axis.ticks = element_blank(),
              legend.title = element_blank(),
              legend.position = "right")
waffle_chart

The major number of cases are in the capital Bogotà.

theme_set(theme_classic())
#number of cases confirmed day by day

#fix day column in "international" format so that R can fix the sorting properly
colombia_covid$`Fecha de diagnóstico`<-as.Date(colombia_covid$`Fecha de diagnóstico`, format="%d/%m/%Y")

colorCount<-length(unique(colombia_covid$`Departamento o Distrito`)) 
getPalette<-colorRampPalette(brewer.pal(9, "Spectral"))(colorCount)

daily_confirmed<-ggplot(colombia_covid, aes(x = `Fecha de diagnóstico`)) +
  scale_fill_manual(values = getPalette)
daily_confirmed + geom_histogram(aes(fill=`Departamento o Distrito`), width = 0.8, stat="count") + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6),
        legend.position = "right") +
  labs(title = "Daily number of confirmed cases", 
       subtitle = "subdivided across departments",
       x = "Date of confirmation",
       fill = "Department")

The previous plot represents the daily incidence of the desease across all the departments we are taking into account.

Let’s check the general trend by looking at the cumulative number of confirmed cases (again, all “our” departments are taken into account):

#the max of the ID de caso for each date is the cumulative number of cases confirmed up to that date
cumulative<- as.data.frame(colombia_covid %>%
  group_by(`Fecha de diagnóstico`) %>%
  summarise(max(`ID de caso`)))
  
names(cumulative)<-c("Fecha de diagnóstico", "Cumulative confirmed")
cumulative
##    Fecha de diagnóstico Cumulative confirmed
## 1            2020-03-06                    1
## 2            2020-03-09                    3
## 3            2020-03-11                    9
## 4            2020-03-12                   11
## 5            2020-03-13                   16
## 6            2020-03-14                   24
## 7            2020-03-15                   45
## 8            2020-03-16                   57
## 9            2020-03-17                   75
## 10           2020-03-18                  102
## 11           2020-03-19                  128
## 12           2020-03-20                  175
## 13           2020-03-21                  210
## 14           2020-03-22                  240
## 15           2020-03-23                  306
## 16           2020-03-24                  419
## 17           2020-03-25                  481
## 18           2020-03-26                  491
## 19           2020-03-27                  539
## 20           2020-03-28                  603
## 21           2020-03-29                  700
## 22           2020-03-30                  798
## 23           2020-03-31                  906
## 24           2020-04-01                 1065
## 25           2020-04-02                 1161
cumulative_confiremd<-ggplot(cumulative, aes(x=`Fecha de diagnóstico`, y=`Cumulative confirmed`))

cumulative_confiremd + geom_point(size=3) +
  geom_segment(aes(x=`Fecha de diagnóstico`,
                   xend=`Fecha de diagnóstico`,
                   y=0,
                   yend=`Cumulative confirmed`)) +
  labs(title = "Cumulative number of confirmed cases",
       x = "Date of confirmation") +
  theme(axis.text.x = element_text(angle=65, vjust=0.6))

Here the growth seems exponential (and this is consistent with the fact that we are studying the early stages of the outbreak).

In order to confirm it we should fit a log-linear model, and check that it produces a constant growth rate (straight line).

Now let’s explore the distribution of cases across genders and age:

library(ggthemes)
brks <- seq(-250, 250, 50)
lbls <- as.character(c(seq(-250, 0, 50), seq(50, 250, 50)))

ggplot(data=colombia_covid, aes(x=`Departamento o Distrito`, fill = Sexo)) +  
                              geom_bar(data = subset(colombia_covid, Sexo == "F")) +
                              geom_bar(data = subset(colombia_covid, Sexo == "M"), aes(y=..count..*(-1))) + 
                              scale_y_continuous(breaks = brks,
                                               labels = lbls) + 
                              coord_flip() +  
                              labs(title="Spread of the desease across genders",
                                   y = "Number of cases",
                                   x = "Department",
                                   fill = "Gender") +
                              theme_tufte() +  
                              theme(plot.title = element_text(hjust = .5), 
                                    axis.ticks = element_blank()) +   
                              scale_fill_brewer(palette = "Dark3")  

Maybe in order to study the distribution of ages we should divide the ages in groups, for example 0-18, 18-30, 30-45, 45-60, 60-75, 75+.

#create column age_group (didn't modify the original dataset though)
age_groups<-colombia_covid %>% mutate(age_group = case_when(Edad <= 18 ~ '0-18',
                                             Edad >= 19  & Edad <= 30 ~ '19-30',
                                             Edad >=  31 & Edad <= 45 ~ '31-45',
                                             Edad >= 46 & Edad <= 60 ~ '46-60',
                                             Edad >=61 & Edad <= 75 ~ '60-75',
                                             Edad >=76 ~ '76+'))

#compute percentage so that we can label more precisely the pie chart
age_groups_pie <- age_groups %>% 
  group_by(age_group) %>%
  count() %>%
  ungroup() %>%
  mutate(per=`n`/sum(`n`)) %>% 
  arrange(desc(age_group))
age_groups_pie$label <- scales::percent(age_groups_pie$per)

library(ggrepel)

age_pie <- ggplot(age_groups_pie, aes(x = "", y = per, fill = factor(age_group))) + 
  geom_bar(stat="identity", width = 1) +
  theme(axis.line = element_blank(), 
        plot.title = element_text(hjust=0.5)) + 
  labs(fill="Age groups", 
       x=NULL, 
       y=NULL, 
       title="Distribution of the desease across ages") +
  coord_polar(theta = "y") +
  #geom_text(aes(x=1, y = cumsum(per) - per/2, label=label))
  geom_label_repel(aes(x=1, y=cumsum(per) - per/2, label=label), size=3, show.legend = F, nudge_x = 0) +
  guides(fill = guide_legend(title = "Group"))
  
age_pie 

This is quite surprising.. I expected elder people to be more affected by Covid-19!

The overall life expectancy in Colombia at birth is 74.8 years (71.2 years for males and 78.4 years for females). Wikipedia

Instead, the median age of the population in 2015 was 29.5 years (30.4 in 2018, 31.3 in 2020), so it is a Country full of young people! link or link or link

Now we can analyze jointly the distribution of age and sex (sex distribution across group of age):

#library(ggpol)

keep <- c("Sexo", "age_group")
age_groups<-age_groups[names(age_groups)%in%keep]
age_groups_count<-aggregate(cbind(pop=Sexo)~age_group+Sexo,
                      data=age_groups,
                      FUN = function(x){NROW(x)})
age_groups_count$count <- ifelse(age_groups_count$Sexo == "F", age_groups_count$pop * -1, age_groups_count$pop)
age_groups_count<-as.data.frame(age_groups_count[names(age_groups_count)!="pop"])

ggplot(age_groups_count, aes(x=age_group, y=count, fill=Sexo)) +
  geom_col() + 
  #facet_share(~Sexo, dir = "h") +
  coord_flip(clip="off") +
  theme_minimal() +
  labs(title = "Distribution of sex by age",
       y = "",
       x = "Age group")

There isn’t much difference between the sexes among the different group of ages, I have the impression that the covariates present in the dataset won’t help us!! :(

We are now left to explore the Tipo variable:

theme_set(theme_classic())
#renamed the attribute since I had sime problem due to the presence of *
colnames(colombia_covid)[8]<-"tipo"

type<-ggplot(colombia_covid, aes(x = `Fecha de diagnóstico`)) +
  scale_fill_brewer(palette = "Set3")
type + geom_histogram(aes(fill=tipo), width = 0.8, stat="count") + theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
  labs(title = "Daily number of confirmed cases", 
       subtitle = "subdivided across type",
       x = "Date of confirmation",
       fill = "Type")

I think that en estudio means that it is not clear while the case is imported or not, however it seems like there are more imported cases, we can count them:

type_pie<- colombia_covid %>% 
  group_by(tipo) %>%
  count() %>%
  ungroup() %>%
  mutate(per=`n`/sum(`n`)) %>% 
  arrange(desc(tipo))
type_pie$label <- scales::percent(type_pie$per)
type_pie<-type_pie[names(type_pie)!="per"]
colnames(type_pie)<-c("tipo", "total_number", "percentage")
type_pie
## # A tibble: 3 x 3
##   tipo        total_number percentage
##   <chr>              <int> <chr>     
## 1 Relacionado          291 29.3%     
## 2 Importado            467 47.0%     
## 3 En estudio           235 23.7%

Almost half of the total confirmed cases in our region of interest are imported, and a significant percentage is anknown wheter it is imported or not. Again this is in some sense interesting, but I don’t see clearly why this should be helpful in our model!

imported<-subset(colombia_covid, tipo=="Importado", select = c("País de procedencia"))
imported %>%
  group_by(`País de procedencia`) %>%
  count() 
## # A tibble: 55 x 2
## # Groups:   País de procedencia [55]
##    `País de procedencia`     n
##    <chr>                 <int>
##  1 0                         1
##  2 Alemania                  4
##  3 Alemania - Estambul       1
##  4 Arabia                    1
##  5 Argentina                 2
##  6 Aruba                     2
##  7 Bélgica                   1
##  8 Brasil                   10
##  9 Canadá                    1
## 10 Chile                     2
## # … with 45 more rows

here data are a bit dirty, however I don’t know if the effort of cleansing them will worth the result.. it depends wheter we decide to use this info in our analysis.

Analyze the epidemic curve separately for each department, considering only those that have more than 30 cases:

covid_dp<-colombia_covid %>%
  group_by(`Fecha de diagnóstico`,`Departamento o Distrito`) %>%
  count()
names(covid_dp)<-c("date", "dep", "n")
#departments with more than 30 cases
relevant<-c("Bogotá D.C.", "Valle del Cauca", "Antioquia", "Cundinamarca", "Risaralda")
covid_relevant_dp<-subset(covid_dp, dep %in% relevant)
covid_relevant_dp
## # A tibble: 83 x 3
##    date       dep                 n
##    <date>     <chr>           <int>
##  1 2020-03-06 Bogotá D.C.         1
##  2 2020-03-09 Antioquia           1
##  3 2020-03-09 Valle del Cauca     1
##  4 2020-03-11 Antioquia           3
##  5 2020-03-11 Bogotá D.C.         2
##  6 2020-03-12 Bogotá D.C.         2
##  7 2020-03-13 Bogotá D.C.         1
##  8 2020-03-13 Valle del Cauca     1
##  9 2020-03-14 Antioquia           3
## 10 2020-03-14 Bogotá D.C.         5
## # … with 73 more rows
ggplot(covid_relevant_dp, aes(x=date, y=n, fill=dep)) +
  geom_bar(stat="identity", show.legend = FALSE) +
  facet_grid(dep~., scales="free_y") + #every level has a different count axis
  theme_dark() +
  theme(strip.text = element_text(face="bold", size=6))+
  labs(title = "Epidemic curve for different departments",
       subtitle = "considering only dept. with more than 30 cases",
       x = "Date of confirmation",
       y = "count",
       fill = "Department")

Analyze the curve of cumulative confirmed cases on those “relevant” department:

ggplot(covid_relevant_dp, aes(x=date, y=cumulative_dep, fill=dep)) +
  geom_bar(stat="identity", show.legend = FALSE) +
  facet_grid(dep~., scales="free_y") + #every level has a different count axis
  theme_gray() +
  theme(strip.text = element_text(face="bold", size=6))+
  labs(title = "Cumulative number of cases for different departments",
       subtitle = "considering only dept. with more than 30 cases",
       x = "Date of confirmation",
       y = "count",
       fill = "Department")

We can see that (except for the Bogotà department) we have a lot of “missing” columns, these are the days in which no data was recorded for the corresponding department, the cumulative number of cases is the same as the previous day reported in the dataset. Maybe there is a way to fix this!

Missing

I still didn’t integrate the “other part” of the dataset, the one concerning deaths!

Ideas

For what concerns the predictive model we want to build, I think that we should start by something very simple (e.g. a (log)linear model) and take it as a baseline.

Then we build something more complex (such as a hierarchical model) and see the improvements with respect to the baseline.

If possible I would put inside something bayesian, since I understood that they really like this kind of stuff!

Gaia I start here with my stream of consciousness about the analysis!

Let’s fix some terminology (references for these are the first two links of the section “Interesting Links”):

  • a case is a person who tested positive for Covid-19;
  • the epidemic curve represents the daily incremental incidence;

Our data are tabulated by date of confirmation by lab test.

The variable we want to predict, say y, (dependent variable) is the (cumulative) number of confirmed cases, namely we want to simulate a (hopefully plausible) epidemic trajectory. Eventually we will project future incidence.

Since our y is a discrete count variable, a linear regression is not appropriate.

The most common choice with count data is to use Poisson regression, which belongs to the GLM class of models.

\[ ln\lambda_i = \beta_0 + \beta_1x_i \\ y_i \sim \mathcal{Poisson}(\lambda_i) \]

I think that, as first step, we should consider the most parsimonious model, taking only the time as independent variable. Here time can be intended both as the Fecha de diagnostico attribute of our dataset, or as the number of days elapsed from the earlier day in the dataset (in this case we should derive this attribute).

Further steps will include more covariates, and a model selection phase should follow. At the moment we are basically working with time series data.

Important: recall that modelling count data with a Poisson regression we are assuming equidispersion (the mean coincides with the variance), however we have no guarantees that this is true for our data, we need to take this into account!

The days 7/03/20, 8/03/20, 10/03/20 are missing, probably because no case was detected in those days (consistent with the fact that it was the very beginning of the outbreak).

cumulative<-cumulative %>%  
mutate(BETWEEN0=as.numeric(difftime(`Fecha de diagnóstico`,lag(`Fecha de diagnóstico`,1))),BETWEEN=ifelse(is.na(BETWEEN0),0,BETWEEN0),elapsed_time=cumsum(as.numeric(BETWEEN)))%>%
select(-c(BETWEEN0,BETWEEN))
#just for praticality
names(cumulative)<-c("date", "y", "elapsed_time")
#cumulative